Evaluation and Program Planning, Vol. 3, pp. 269-276, 1980 
Printed in the U.S.A. All rights reserved. 


0149-7189/80/020269-08$02.00/0 
Copyright © 198] Pergamon Press Ltd 


AVOIDING TYPE II ERROR IN PROGRAM EVALUATION 


Results From a Field Experiment 


DOUGLAS DOBSON 


Northern Illinois University 


and 


THOMAS J. COOK 


Research Triangle Institute 


ABSTRACT 


A common, yet questionable assumption underlying many evaluations of service intervention 
programs is that program clients uniformly receive the services purportedly available. The 
authors draw upon the experience of a randomized field experiment to point out the hazards of 
that assumption. They found marked differences among clients in the amount of actual service 
received during participation in the program evaluated. Moreover, the data suggest that pro- 
gram outcomes varied as a function of the amount of service received. These findings are of- 
fered as a cautionary note to other evaluators; the amount of service actually received by clients 
should be accurately recorded and incorporated into the analyses of program outcomes. 


The development of policy evaluation as an area of in- 
quiry has focused the attention of researchers, ad- 
ministrators, and legislators alike on the outcomes of 
public programs. In the private sector, the ubiquitous 
profit and loss statement serves as a generally accepted 
measure of performance. Lacking a conceptual or em- 
pirical equivalent for public sector programs, substan- 
tial effort has been made to establish performance stan- 
dards which would provide insights regarding the extent 
to which program goals are actually achieved. 
Moreover, there have been significant developments in 
the methodology of program evaluation over the past 
few years. Due largely to the work of scholars such as 
Campbell and Stanley (1966), Cook and Campbell 
(1975), Boruch and Riecken (1974), and Fairweather 
and Tornatzky (1977), evaluators have become increas- 
ingly aware of the need for strong evaluation designs 
which effectively rule out rival interpretations of data. 

By and large, it appears to be the case that these 


developments have led to the conclusion that social pro- 
grams do not work. That is, outcomes are not consis- 
tent with objectives. In Rossi’s (1978) words: 


If there is any empirical law that is emerging from the past 
decade of widespread evaluation research activities, it is 
that the expected value for any measured effect of a social 
program is zero. In short, most programs, when properly 
evaluated, turn out to be ineffective or at best marginally 
accomplishing their set aims. There are enough exceptions 
to prevent this empirical generalization from being phrased 
as the “Iron Law of Social Program Evaluation,” but the 
tendency is strong enough to warrant placing bets on 
negative evaluation outcomes in the expectation of making 
a steady but modest side income. (p. 574) 


As Rossi points out, these results have led to recon- 
siderations of both methods of program evaluation and 
social programs themselves. Whether social programs 
simply do not work or whether available methods are 


Requests for reprints should be sent to Thomas J. Cook, Research Triangle Institute, P.O. Box 12194, Research Triangle Park, North Carolina 


27709. 


269 


270 DOUGLAS DOBSON and THOMAS J. COOK 


insufficiently sensitive is presently not clear. There 
does, however, appear to be increasing concern, among 
some researchers, regarding the question of whether 
outcome-focused evaluations, which treat social pro- 
grams as “black boxes,” are appropriate (Patton, 1978; 
Quay, 1977; Rossi, 1978, 1979; Sechrest et al., 1979). 
The issue here revolves around the notion of treatment 
implementation. Simply put, if treatments are not 
clearly specified and if the services implied by those 
treatments are not delivered in a way which is consistent 
with program objectives, it is likely that evaluation 
results will be less than useful or, perhaps, meaningless. 
More importantly, potentially effective programs may 
be dismantled due to negative results deriving not from 
inefficacious treatment, but from a failure to consider 
whether the treatment was ever delivered at all. Scanlon 
et al. (1977) have characterized this situation in terms of 
“type III error.” 


Statisticians worry about two type of errors . . .: Type I er- 
ror is rejecting a hypothesis when it should be accepted; 
Type II error is accepting a hypothesis when it should be re- 
jected... . Evaluators comonly make two types of errors in 
doing evaluations: Type III error is measuring something 
that does not exist; Type IV error is measuring something 
that is of no interest to management and policy makers. (p. 
36) 


That one might ask whether treatments are actually 
delivered may seem an obvious point. Yet it is not dif- 
ficult to cite examples of evaluations in which 
treatments were nonexistent, inappropriately delivered, 
or highly idiosyncratic. (Rossi, 1978, describes a 
number of cases.) Even when one begins an evaluation 
with a clear understanding of the nature of the treat- 
ment and its relationship to hypothesized outcomes, 
answering the question of whether the treatment was 
implemented can be difficult. Human service pro- 
grams, in particular, may deliver treatments in a variety 
of nonstandardized ways, varying with locales, service 
deliverers, and recipients. Under such conditions, 
assessment of whether the program was implemented, 
or to what extent it was implemented, may be highly 
problematic. 

This article will consider some aspects of program 
implementation in the context of a multi-year ex- 
perimental evaluation of an ex-offender rehabilitation 
program. After briefly describing the research setting 
and evaluation design, some of the problems encoun- 
tered in the specification and measurement of treatment 
implementation will be examined. We shall also con- 
sider the effects of implementation on our ability to 
assess program outcomes in a single performance area. 


THE RESEARCH SETTING 


The Safer Foundation, located in Chicago, Illinois, 
operates several programs serving ex-offenders in the 
metropolitan area. Two are of interest in the context of 
this article. The first, DARE, functions exclusively as a 
job placement program for ex-offenders. The program 
is staffed by job coaches, who counsel clients on effec- 
tive utilization of their existing skills and the develop- 
ment of new skills, and by job developers, who serve as 
the link between ex-offender clients and potential 
employers. 

The second program, Challenge, serves a wider range 
of functions. Specifically, three subprograms make up 
the core of Challenge. First, the Citizen Volunteer com- 
ponent link ex-offender clients with volunteers from the 
community. After an initial meeting and mutual con- 
currence, clients and volunteers are matched for a 
period of one year. Citizen volunteers may provide 
assistance to clients in a variety of ways. They may, for 
example, simply be “someone to talk with” about every- 
day problems. They may also provide direct and refer- 
red guidance in locating jobs, housing, clothes, and 


other basic necessities. When necessary, volunteers may 
also assist the client in obtaining legal representation. 

The second component of Challenge matches a single 
VISTA volunteer with a small number of clients (usual- 
ly 10 to 15) for a period of three months. These 
volunteers function as part of the Challenge staff and 
are paid by VISTA. The VISTA volunteer, like the 
citizen volunteer, attempts to provide support and ser- 
vices explicitly tailored to the needs of individual 
clients. One of the more interesting facets of this pro- 
gram alternative is the fact that the VISTA volunteers in 
the program are also ex-offenders. The rationale for 
this arrangement is that ex-offenders are more likely to 
be attuned to the difficulties of other ex-offenders who 
are in the initial stages of re-adjustment to life on the 
outside. 

Finally, the Paraprofessional component differs 
from the VISTA alternative in that the length of the 
Paraprofessional-client match is six rather than three 
months. Like the VISTAs, some of the Parprofes- 
sionals are also ex-offenders. 


OVERVIEW OF THE EXPERIMENTAL DESIGN 


A schematic view of the experimental design for the 


evaluation, initiated in late 1977, is shown in Table 1 


As may be seen, four groups of service recipients were 
inchided in the experiment. It should be noted that the 


Avoiding Type HI Error 271 


TABLE 1 
EXPERIMENTAL DESIGN FOR THE CHALLENGE EVALUATION 


Number of Follow-Up 
Clients Program Program (Exit + 
Group Assigned Intake Exit 6 Months) 
DARE 106 R O, X, Oo O03 
DARE plus 
Citizen Volunteer 106 R O, Xo O05 O03 
DARE plus VISTA 109 R 0, X3 0, 0, 
DARE plus 
Paraprofessional 46 R 0, X, O, O, 


Where: R = random assignment of clients to groups; O = data collection points; X = treatment (i.e., program service) levels. 


control group in this design is not “pure” in the sense of 
being a “no-treatment” alternative. In this instance, 
ethical considerations clearly precluded such an option. 
Moreover, from a logistical perspective, the extreme 
mobility of ex-offender populations suggested the 
likelihood of an inordinately high attrition rate in a 
pure control group. These considerations led to the 
selection of DARE service recipients as the group 
against which Challenge service alternatives would be 
evaluated. Since all clients normally receive DARE ser- 
vices as part of intake procedures, the treatment alter- 
natives were defined as: (1) DARE services alone; (2) 
DARE services plus the Citizen Volunteer alternative; 
(3) DARE services plus the VISTA alternative; and (4) 
DARE services plus the Paraprofessional alternative. 


Three data collection points were included in the 
design. The first occurred at the point of intake to the 
program. The second occurred when the client exited 
from the program: three months for the DARE-only 
and VISTA groups; six months for the Paraprofes- 
sional group; and one year for the Citizen Volunteer 
group. Follow-up occurred six months after the client 
had completed the treatment alternative. These data 
collection points represent interviews with the client as 
well as the collection of data on recidivism, job reten- 
tion, residential stability, and several other outcome 
measures (for a more complete description see Dobson 
and Cook, 1979; Rezmovic, Cook and Dobson, in 
press; and Cook, Dobson, and Rezmovic, 1980). 


THE CHALLENGE TREATMENTS 


The Challenge program seeks to achieve several goals. 
Central among these are low incidence of recidivism, 
high levels of job placement and retention, and residen- 
tial stability. Treatments are conceptually linked to pro- 
gram goals through “problem-solving” intervention on 
the part of citizen volunteers, VISTAs, and paraprofes- 
sionals. These staff workers are assigned the task of 
assisting clients in setting short range goals and, subse- 
quently, they provide the core of a support network for 
the client as he/she attempts to reach those goals. Thus, 
the treatment is highly diffuse and, in principle, tailored 
to the specific goal-related needs of individual clients. 
If, for example, clients are in need of temporary hous- 
ing, clothes, food, money or medical attention, to take 
a few examples, staff workers may provide direct refer- 
rals, or, perhaps, invoke the resources of Safer’s Sup- 
portive Services unit to meet immediate needs. Alter- 
natively, if the clients’ primary needs relate to job ac- 
quisition, staff workers may actually provide job leads, 
intervene with the DARE program staff, or refer clients 
to other job placement services. As needed, staff 
workers may also assist clients in obtaining appropriate 
legal representation, assist with’ problems which 
develop between parole officers and clients, or less con- 
cretely, simply be available for “rap” sessions. 


Whatever the specific services, the fundamental con- 
ceptual link between treatment and outcomes is that the 
avajlability of a need-specific resource person is likely 
to increase the probabilities of job acquisition, extend- 
ed job retention, residential stability, and decrease the 
probability of recidivism. 

Obviously, under this kind of intervention concept, 
exact specification of what treatments clients did or did 
not receive is, at best, quite difficult. Theoretically, one 
could count the number of job referrals provided by 
staff workers, or perhaps the number of referrals to 
various kinds of social service agencies. One could even 
consider the possibility.of tape recording interactions 
between clients and staff workers. As a practical mat- 
ter, however, such data collection efforts are not feasi- 
ble. Meetings between clients and staff workers take 
place at random intervals in restaurants, clients’ homes, 
or even street corners, and few records are kept. In fact, 
program guidelines specify that such meetings should 
take place on “neutral ground” where clients are likely 
to feel “secure.” Even if such data collection efforts had 
been feasible, it is not likely, given the “need-based” 
nature of the treatment, that full comparability could 
have been established. 

Thus, to the extent that treatment in this instance is to 


272 DOUGLAS DOBSON and THOMAS J. COOK 


be understood as deriving from the nature of the in- 

teraction of clients and staff, exact specification of the 

type or strength of treatment actually received by clients 
presents rather intractable problems. This admission 

does not, however, preclude the possibility of taking a 

somewhat less direct approach to the measurement of 

treatment integrity. In the context of this program, 
there were events which had to occur if “treatment” was 
to be delivered to clients at all. In particular: 

1. All clients assigned to one of the Challenge groups 
were to be “matched” with a Citizen Volunteer, 
VISTA worker, or Paraprofessional. Unless clients 
were, in fact, matched, no Challenge services were 
available. Since the match was consummated with an 
agreement signed by the client and staff worker, it 
was possible to learn whether a match ever actually 
took place. 

2. The scheduling of additional meetings between the 
client and staff worker occurred at the point of 
match. Reports were filed after each meeting with 
clients. Thus, it was possible to estimate the delay 
which occurred before the actual initiation of treat- 
ment, i.e., before the first meeting of client and staff 
worker. 


3. Treatments were planned for varying duration. 
Clients in the Citizen Volunteer group were to be 
matched for one year; clients in the VISTA group 
were to be matched for three months; and clients in 
the Paraprofessional group were to be matched for 
six months. The availability of reports subsequent to 
each meeting between staff and clients, along with a 
terminal report filed when the client was dropped 
meant that estimates of the actual duration of treat- 
ment were possible. 

4. Information regarding the amount of time spent 
with the client was filed after each meeting. Al- 
though admittedly weak, these were the only avail- 
able data which might reasonably be expected to cor- 
relate with intensity of treatment. 

In terms of the range of information that one might 
like to have about the nature of the experimental 
treatments and delivery mechanisms relevant to the 
evaluation, these data are clearly less adequate than one 
might hope. They do, however, appear to reflect at least 
the broad elements of treatment implementation within 
the context of this program. 


TREATMENT IMPLEMENTATION 


With these service delivery criteria in mind, we turn to 

the question of whether the Challenge treatment alter- 

natives were in fact, implemented in the experimental 
groups. In this context, we understand the meaning of 
treatment implementation to be: 

1. That the client was actually matched to a staff 
worker. 

2. That, once matched, treatment was initiated within a 
reasonable period of time. Obviously, what consti- 
tutes a “reasonable” time period is somewhat vague. 
Discussions with program staff indicated, however, 
that about five weeks from the date of intake would 
be considered acceptable in the context of program 
operations. For purposes here, we have set 40 days 
from intake as the permissible time lapse prior to 
treatment initiation. 

3. That clients received treatment for the amount of 
time appropriate for each group: one year for Staff 
Associates; three months for VISTAs; and six 
months for Paraprofessionals. 

4. That clients actually spent a reasonable amount of 
time with the staff worker with whom they were mat- 
ched. Once again, the problem of establishing a 
“reasonable” time period was resolved through ex- 
amination of current program practices and discus- 
sions with staff. Although program guidelines do 
not establish the number of hours to be spent with 
clients, indicators were that about five hours of 


client-staff contact per month would be regarded as 

an acceptable level. 

Relying upon these criteria — which we have charac- 
terized as “strong” ~— Figure 1 clearly indicates that the 
program was much less than fully implemented. In fact, 
41% of the clients were never matched with a staff 
worker. These experimental clients never received any 
of the services available through the Challenge pro- 
gram. At the other extreme (Level 4), the data indicate 
that the treatment was “fully” implemented for only 
about one out of every twenty clients (5%). Just over 
one-third (36%) of the clients were matched, had treat- 
ment initiated, but were in the program for less than the 
specified time period. Finally, about 6% of the clients 
met the first three criteria, but did not have a sufficient 
number of contact hours with program staff. 

From a programmatic perspective, these figures 
carry a straightforward message: for the vast majority 
of clients the program was simply not implemented in 
its intended form. Just under half of the experimental 
clients received no treatment at all. Of the remainder, 
many were either not contacted until long after the 
match was established or they dropped out of the pro- 
gram prior to the completion of the treatment period. 

A reasonable argument could be constructed that, 
given such results, analysis should proceed no further. 
The research design was intended to measure program 
outcomes and was predicated on the assumption that a 


Avoiding Type III Error 273 


ASSIGNED 100% 
LEVEL 1 59% 
LEVEL 2 36% 
LEVEL 3 6% 
LEVEL 4 5 
0 10 2 #23 40 #250 6 70 8 90 100 


PERCENTAGE OF ALL EXPERIMENTAL CLIENTS 


KEY: 
ASSIGNED -— clients assigned to an experimental group 
LEVEL 1 ~ clients matched with a staff worker 


LEVEL 2 — clients matched with a staff worker and having treat- 
meant initiated within 40 days aftar match 


LEVEL 3 — clients matched with a staff worker; having treatment 
initiated within 40 days; and in treatment for specified 
time period 

LEVEL 4 — clients matched with a staff worker; having treatment 
initiated within 40 days; in treatment for specified 
time period; and in contact with staff worker at least 
five haurs per month 


Figure 1. Levels of Treatment Implementation: “Strong” Criteria. 


four-year old program was sufficiently well- 
implemented to sustain an evaluation.* Obviously, that 
assumption was ill-founded. Of the 261 clients assigned 
to an experimental condition, treatment appears to 
have been fu//y implemented for only 14! 


The difficulty, though, is akin to “throwing out the 
baby with the bath water.” Assuming, for the moment, 
null results (i.e., insignificant differences in treatment- 
control group comparisons), should one conclude that 
the treatments themselves were inefficacious or that 
results were a function of failures in treatment delivery 
mechanisms? In the former case deficiencies are pro- 
bably not correctable, but in the latter, improvement of 
outcome performance may be achievable by simply 
monitoring the imposition of treatment more closely. 

In the next section, we shall attempt to address the 
question of whether treatment integrity affects out- 
come performance. Prior to doing so, however, we 
must briefly reconsider our measure of treatment im- 
plementation. It will be recalled that only 14 clients 
“passed” all implementation criteria. This small 
number of cases is insufficient to sustain analysis. 
Thus, for the purposes here, we have elected to ease the 
third implementation criterion. Specifically, clients 
were categorized as having passed the Level 3 criterion 
if they remained in the program for at least 25% of the 
time projected for their respective treatment groups. 
This procedure increased the percentage of clients pas- 
sing Level 3 from 6% to 25%. Obviously, this results in 
a somewhat “relaxed” measure of treatment implemen- 
tation. It does, however, make analyses tractable. 
Moreover, it seems reasonable to assume that if effects 
are detected with relaxed criteria, stronger criteria 
should produce effects which are at least as strong and, 
perhaps, stronger. 


GROUPS, TREATMENTS AND OUTCOMES 


We begin our analysis by asking whether treatments 
were differentially implemented across the experimen- 
tal groups. As Table 2 shows, the evidence suggests that 
there were systematic differences in treatment imposi- 
tion. Among those clients assigned to the Citizen 
Volunteer group, just under one-third (32.1%) met all 
of the “relaxed” implementation criteria. In contrast, 
full treatment was implemented for only 22.0% of the 
VISTA group and for only about 8.7% of the 
Paraprofessional group. In retrospect, these findings 
seem rather obvious to the authors. Prior to the initia- 
tion of the evaluation, program administrators em- 
phasized that their primary concerns centered around 
the involvement of citizens in efforts to provide support 
to ex-offenders. It was not, however, until well after the 
evaluation was under way that we began to suspect that 


*It would have been possible, of course, to confront the treatment im- 
plementation question “head-on” with a more complex design that in- 
cluded treatment level as a design variable. To have done so, however, 
would have increased the cost of the experiment to a prohibitive level. 


the VISTA and Paraprofessional units were something 
less than full partners in program operations. Even 
then, however, we did not suspect that such large dif- 
ferentials in treatment were likely to exist. 

As will be recalled from Table 2, the original design 
called for an examination of the impact of treatment 
alternatives across several outcome areas. For the pur- 
pose of this paper, we have selected only one area (job 
acquisition) for examination. Our basic findings are 
shown in Table 3. There, the percentages of clients ac- 
tually placed in jobs are shown by treatment group. The 
findings are unambiguous. The hypothesis that treat- 
ment alternatives had no effect cannot be rejected. In 
addition, a cost-effectiveness analysis (not shown here) 
indicated that the Challenge operations were 
significantly more expensive than DARE when costs- 
per-placed client were considered (Cook, Dobson, & 
Rezmovic, 1980). On the basis of these data, the thrust 
of recommendations which one might make is clear: the 
treatments do not appear to have any discernable ef- 
fects relative to the DARE group. 

We know, however, albeit ex post facto, that the 


274 DOUGLAS DOBSON and THOMAS J. COOK 


TABLE 2 
TREATMENT IMPLEMENTATION BY GROUP* 
(N = 261) 
Group 
Treatment Level Staff Associate VISTA Paraprofessional 
Not Implemented 42.5 39.4 43.5 
Level 1 18.9 22.9 32.6 
Level 2 5.7 13.8 15.2 
Level 3 9 1.8 0.0 
Level 4 32.1 22.0 8.7 
Total 100.0 100.0 100.0 

(n) (106) (109) (46) 


*X2 = 15.78; p = .045. 


TABLE 3 
TREATMENT GROUP AND EMPLOYMENT OUTCOMES: JOB ACQUISITION* 


Group 
Job Status DARE Citizen Volunteer VISTA Paraprofessional 
Not Placed 51.9 49.1 52.3 52.2 
Placed 48.1 50.9 47.7 47.8 
Total 100.0 100.0 100.0 100.0 
(n) (106) (106) (109) (46) 


*X? = 28: p =.96 


treatment was not fully implemented and, moreover, 
that there were significant differences in levels of im- 
plementation across the treatment alternatives. Thus, 
one might reasonably argue that the results shown in 
Table 3 do not accurately portray the effects of treat- 
ment. Clearly the data show the effects of assignment to 
the treatment groups, but each group combines a mix- 
ture of both treated and un-treated clients. 

When levels of treatment imposition are brought into 
the analysis, a dramatically different picture emerges. 
Table 4 shows job acquisition rates for DARE (the 
“control” group), and for each level of treatment im- 
position in Challenge. Differences are quite marked. 
Among the DARE clients, 48.1% were placed in jobs. 
Among Challenge clients for whom treatment was not 
implemented, the percentage placed was over 10% 
lower (37.0%). Thereafter, passage of each successive 
treatment criterion yields a higher probability of job 
placement. Among Challenge clients for whom treat- 
ment was fully implemented (by “relaxed” criteria!), 
over two-thirds (67.7%) were actually placed in jobs. 

Further, if we compare experimental clients who ac- 
tually received treatment to the control group (Table 5), 
it is clear that there are significant differences between 
the control and experimental conditions. For clients 
who received treatment the lowest placement rate was 
62.5 % in the VISTA group. For both of the remaining 
groups, placement rates exceeded 70% (see author’s 
note). 


The findings raise a complex set of questions which, 
unfortunately, cannot be fully resolved within the con- 
text of the present design. The evaluation was conceived 
to examine the impact of treatment alternatives. The 
possibility that treatment might be delivered at dif- 
ferential levels, while recognized at the outset of the 
study, was not explicitly incorporated into the ex- 
perimental design through the use of treatment im- 
plementation as a design variable. Instead, we collected 
data on the actual treatment received by clients for use 
in the analysis of outcomes. 

When we examined the hypotheses that the design 
was intended to test, we found null results. Yet positive 
results were observed when the integrity of treatments 
was considered. The fundamental dilemma, of course, 
is that ex post facto consideration of treatment im- 
plementation vitiates the integrity of the original 
design. Certainly, the analysis presented above is sug- 
gestive of positive treatment effects; but, there are com- 
peting explanations which are quite difficult to dis- 
count. Probably the most serious of these relates to 
potential selection biases which may have operated. We 
have already noted, for instance, that program ad- 
ministrators appeared to be substantially more con- 
cerned with the Citizen Volunteer treatment alternative 
than with either the VISTA or Paraprofessional alter- 
natives. Moreover, and perhaps partially due to the 
focus of administrative concerns, we observed signifi- 
cant staff turnover in both the VISTA and Paraprofes- 


Avoiding Type III Error 275 


TABLE 4 
JOB ACQUISITION AND TREATMENT IMPLEMENTATION 


Treatment Level 


Job Status DARE Not Implemented Level 1 Level 2 Level 3 Level 4 
Not Placed 51.9 63.0 51.7 42.9 33.3 32.3 
Placed 48.1 37.0 48.3 57.1 66.7 67.7 
Total 100.0 100.0 100.0 100.0 100.0 100.0 
(n) (100) (108) (60) (28) (3) (62) 
*X? = 16,06; p = .007 
TABLE 5 


TREATMENT GROUP AND JOB ACQUISITION* 
(“Treated” Experimental Clients Only) 


Group 
Job Status DARE Citizen Volunteer VISTA Paraprofessional 
Not Placed 519 29.4 52.3 25.0 
Placed 48.1 70.6 62.5 75.0 
Total 100.0 100.0 100.0 100.0 
(n) (106) (34) (24) (4) 


*X? = 12,56; p< .01 


sional units during the early phases of the experiment. 
There is little doubt that such administrative difficulties 
led to delays in making contact with some VISTA and 
Paraprofessional clients. Such delays may have had a 
direct effect on client attrition from treatment. 
Similarly, clients may have self-selected within treat- 
ment alternatives. It is quite conceivable, for example, 
that the “most difficult” cases — e.g., clients with drug 
problems, intense personal problems, or continued in- 
volvement in criminal activities — withdrew from the 
program of their own volition; leaving clients with 
higher probabilities of ultimate placement. It is also ex- 
ceedingly likely that program staff has great difficulty 
in establishing and maintaining contact with such 
clients. Whatever the processes, the essential point is 
that selection, at the very least, consitutes a major 


threat vis-a-vis the observed effects of treatment im- 
plementation. 

As program evaluators, charged with arriving at 
some prescriptive conclusions, these results open Pan- 
dora’s box. True, the treatment alternatives had no ap- 
parent effect. We set alpha in the neighborhood of .05 
and, based on that decision rule, we are confident that a 
Type I error has not occurred with respect to this per- 
formance area. But the analysis of implementation data 
serves to lay open a significant likelihood of committing 
a Type Ill error, i.e., evaluating a “phantom” program. 
In particular, the possibility that the treatment, when 
appropriately implemented, does have positive effects 
must be squarely faced. To do less is quite likely to deny 
future funding to a program which, if properly im- 
plemented, may be quite effective. 


SUMMARY AND CONCLUSIONS 


In this article, we described an evaluation research 
design regarded by all parties as sufficiently strong to 
evaluate the outcomes of a large-scale ex-offender pro- 
gram. We presented results suggesting that anlyses con- 
ducted within the framework of the original design lead 
to null results. Yet, when the question of program im- 
plementation was considered, doubt was cast on the 
primary findings. In instances where the program was 
implemented, there was evidence suggesting positive 
outcomes. 


Practically speaking, as researchers who must face 
the task of making policy recommendations options are 
rather limited: we are caught in the grey mist of “maybe 
yes, maybe no.” Perhaps the most instructive conclu- 
sion is that the evidence warrants re-evaluation of the 
program after substantial effort is made to “clean up” 
implementation problems and ensure the integrity of 
treatment delivery. 

In any event, the lesson here is clear. In social service 
programs it is absolutely essential that the issue of treat- 


276 DOUGLAS DOBSON and THOMAS J. COOK 


ment integrity be examined prior to the initiation of any 
evaluation effort. In fact, our experience leads us to 
strongly concur with the thrust of recent pleas that eval- 
uation proceed in stages. At a minimum, implementa- 
tion parameters must be known prior to the assessment 


of program outcomes to safeguard against Type III er- 
rors. If feasible, expected treatment implementation 
levels should be included as design variables in the 
evaluation. 


AUTHOR’S NOTE: In considering these results it should be understood clearly that the Dare group was not a 
“pure” control group; a number of services were offered through Dare. In analyses for her dissertation our col- 
league, Eva Rezmovic, has suggested that the Dare Program, as well as the Challenge Program, was probably im- 
plemented differentially — hence, not all of the Dare group reviewed the same level of service. While such differen- 
tial implementation of Dare may affect job acquisition rates for both Challenge and Dare clients, since both groups 
received Dare services, this does not gainsay the argument presented in this paper. Namely, that levels of treatment 
exposure need to be included in the analysis of evaluation results. The results of analyzing treatment levels for the 


Dare control group will be reported in the future. 


REFERENCES 


BORUCH, R., & RIECKEN, H. Social experiments: A method for 
planning and evaluating social programs. New York: Academic 
Press, 1974. 


CAMPBELL, D.T., & STANLEY, S.C. Experimental and quasi- 
experimental designs for research. Chicago: Rand McNally, 1966. 


COOK, T.D. & CAMPBELL, D.T. The design and analysis of quasi- 
experiments and true experiments in field settings. InM.D. Dunette & 
J.P. Campbell (Eds.), Handbook of industrial organizational 
research, Chicago: Rand McNally, 1975. 


COOK, T.J., DOBSON, D., & REZMOVIC, E.L. Working with ex- 
offenders: The challenge experiment. Research Triangle Park, N.C.: 
Research Triangle Institute, 1980. 


DOBSON, D., & COOK, T.J. Implementing random assignment: A 
computer-based approach in a field experimental setting. Evaluation 
Quarterly, August 1979, 3(3), 472~489. 


FAIRWEATHER, G.W., & TORNATZKY, L.G. Experimental 
methods for social policy research, New York: Pergamon, 1977. 


PATTON, M. Evaluation of program implementation. In M.Q. Pat- 
ton (Ed.), Utilization-focused evaluation. Beverly Hills, Calif.: Sage 
Publications, 1978. 


QUAY, H.C. The three faces of evaluation: What can be expected to 
work. Criminal Justice and Behavior, 1977, 4(4), 341-354. 


REZMOVIC, E.L., COOK, T.J., & DOBSON, D. Beyond random 
assignment: Factors affecting evaluation integrity. Evaluation Re- 
view, in press. 


ROSSI, P.H. Issues in the evaluation of human services delivery. 
Evaluation Quarterly, 1978, 2(4), 573-599. 


ROSSI, P.H. et al. Evaluation: A systematic approach. Beverly Hills, 
Calif.: Sage Publications, 1979. 


SCANLON, J.W. et al. Evaluability assessment: Avoiding Type HI 
or IV errors. In G.R. Gilbert & P.J. Conklin (Eds.), Evaluation 
Management: A source book of readings. Charlottesville: U.S. Civil 
Service Commission, 1977. 


SECHREST, L. et al. Some neglected problems in evaluation re- 
search: Strength and integrity of treatments. Evaluation Studies 
Review Annual. Beverly Hills, Calif.: Sage Publications, 1979. 


